Decision trees and Ensemble Methods

DDD: Elements of Statistical Machine Learning & Politics of Data

Ayush Patel

At Azim Premji University, Bhopal

15 Feb, 2026

Hello

I am Ayush.

I am a researcher working at the intersection of data, development and economics.

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Did you come prepared?

  • You have installed R. If not see this link.

  • You have installed RStudio/Positron/VScode or any other IDE. It is recommended that you work through an IDE

  • You have the libraries {tree} {gbm} {randomForest} {BART} installed

Learning Goals

  1. What are decision trees?
  2. When to use them?
  3. How do these work?
  4. Application and interpretation.
  5. Improving decision trees using ensemble methods.

Decision Trees


Supervised and non-parametric

Can be used for Classification and Regression

A predictor space is cut into segments and mean response of training observations is used as the estimate of response of test observations.

Simple, easy to interpret but not the best for prediction accuracy on its own

Prediction accuracy can be improved by ensemble methods

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Model output - Predictor Space

from ISLR

Model output - Tree

from ISLR

Region representation



\(R1 = \left\{X|Years<4.5\right\}\)

\(R2 = \left\{X|Years>=4.5,Hits<117.5\right\}\)

\(R2 = \left\{X|Years>=4.5,Hits>=117.5\right\}\)

Tree terminology

from ISLR

How should we carry out segmenting of predictor space?

Segmenting - Theory

Predictor Space of \(p\) variables needs to be segmented into \(J\) different regions.

In theory, the regions can be of any shape, however high-dimensional rectangles are chosen in practice for computational ease and interpretability

For every observation in region \(R_j\), we make the same prediction. Mean or mode of the training observations in region \(R_j\)

Minimize: \(\sum_{j=1}^{J}{\sum_{i \in R_j}{(y_i - \hat{y}_{R_j})^2}}\)

But

Not easy to consider all possible cutpoints for all possible predictors with all possible sequences

We use high-dimensional rectangles instead of any shape for ease of interpretation.

So, use Recursive Binary splitting

top-down

greedy

top-down



We brgin at the point where all observations are part of the same region. Hence the name top-down

Greedy



Best split at a particular step. We do not care about the future. A predictor \(p\) and a cutpoint \(s\) is chosen based on which split will lead to the lowest RSS.

This is carried out recursively, over and over again.

Formally

We aim to minimize the following at every step

\(\sum_{i:x_i \in R_1 (j,s)}{(y_i - \hat{y}_{R_1})^2} + \sum_{i:x_i \in R_2 (j,s)}{(y_i - \hat{y}_{R_2})^2}\)

Fitting Regresison Trees

from {palmerpenguins} website

Fitting Regresison Trees

tree(body_mass_g ~ ., data = penguins) -> peng_mass_tree

peng_mass_tree
node), split, n, deviance, yval
      * denotes terminal node

1) root 333 215300000 4207  
  2) species: Adelie,Chinstrap 214  40430000 3715  
    4) sex: female 107   8493000 3419 *
    5) sex: male 107  13240000 4010 *
  3) species: Gentoo 119  29670000 5092  
    6) sex: female 58   4519000 4680 *
    7) sex: male 61   5884000 5485 *

Fitting Regression Trees

peng_mass_tree$frame
      var   n       dev     yval splits.cutleft splits.cutright
1 species 333 215259666 4207.057            :ab              :c
2     sex 214  40428633 3714.720             :a              :b
4  <leaf> 107   8493224 3419.159                               
5  <leaf> 107  13241192 4010.280                               
3     sex 119  29674443 5092.437             :a              :b
6  <leaf>  58   4519321 4679.741                               
7  <leaf>  61   5884098 5484.836                               

Fitting Regression Trees

peng_mass_tree$where
  1   2   3   5   6   7   8  13  14  15  16  17  18  19  20  21  22  23  24  25 
  4   3   3   3   4   3   4   3   4   4   3   3   4   3   4   3   4   3   4   4 
 26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45 
  3   4   3   3   4   3   4   3   4   3   4   4   3   3   4   3   4   3   4   3 
 46  47  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 
  4   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
 67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86 
  3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
 87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
  4   3   4   3   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
  3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 
  3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 
  4   3   3   4   3   4   6   7   6   7   7   6   6   7   6   7   6   7   6   7 
167 168 169 170 171 172 173 174 175 176 177 178 180 181 182 183 184 185 186 187 
  6   7   6   7   6   7   7   6   6   7   6   7   7   6   7   7   6   6   7   6 
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 
  7   6   7   6   7   6   7   6   7   7   6   6   7   6   7   6   7   6   7   6 
208 209 210 211 212 213 214 215 216 217 218 220 221 222 223 224 225 226 227 228 
  7   6   7   6   7   6   7   6   7   6   7   7   6   7   6   7   7   6   6   7 
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 
  6   7   6   7   6   7   6   7   6   7   6   7   6   7   6   7   6   7   6   7 
249 250 251 252 253 254 255 256 258 259 260 261 262 263 264 265 266 267 268 270 
  7   6   6   7   6   7   6   7   7   6   7   6   7   6   7   6   7   6   7   7 
271 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 
  6   6   7   6   7   3   4   4   3   4   3   3   4   3   4   3   4   3   4   3 
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 
  4   4   3   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   4 
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 
  3   3   4   3   4   4   3   4   3   3   4   3   4   4   3   3   4   3   4   3 
332 333 334 335 336 337 338 339 340 341 342 343 344 
  4   3   4   4   3   4   3   3   4   3   4   4   3 

Fitting Regression Trees

summary(peng_mass_tree)

Regression tree:
tree(formula = body_mass_g ~ ., data = penguins)
Variables actually used in tree construction:
[1] "species" "sex"    
Number of terminal nodes:  4 
Residual mean deviance:  97680 = 32140000 / 329 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-760.30 -219.20   15.16    0.00  220.30  815.20 

Fitting Regression Trees

plot(peng_mass_tree)
text(peng_mass_tree, pretty = 0)

Fitting Regression Trees

na.omit(penguins) |>
    mutate(
        pred_mass = predict(peng_mass_tree)
    ) |>
        relocate(body_mass_g, pred_mass, everything())
# A tibble: 333 × 9
   body_mass_g pred_mass species island    bill_length_mm bill_depth_mm
         <int>     <dbl> <fct>   <fct>              <dbl>         <dbl>
 1        3750     4010. Adelie  Torgersen           39.1          18.7
 2        3800     3419. Adelie  Torgersen           39.5          17.4
 3        3250     3419. Adelie  Torgersen           40.3          18  
 4        3450     3419. Adelie  Torgersen           36.7          19.3
 5        3650     4010. Adelie  Torgersen           39.3          20.6
 6        3625     3419. Adelie  Torgersen           38.9          17.8
 7        4675     4010. Adelie  Torgersen           39.2          19.6
 8        3200     3419. Adelie  Torgersen           41.1          17.6
 9        3800     4010. Adelie  Torgersen           38.6          21.2
10        4400     4010. Adelie  Torgersen           34.6          21.1
# ℹ 323 more rows
# ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>

Do it yourself


Run the linear model using the penguins data:
\(bodymass = species + sex\)
compare the RSS for the linear model with the RSS of the decision tree.

Cant just keep splitting


some terminal nodes have very few observations

node), split, n, deviance, yval
      * denotes terminal node

 1) root 263 53320000  535.9  
   2) CHits < 450 117  5931000  227.9  
     4) AtBat < 147 5  2940000  709.5 *
     5) AtBat > 147 112  1779000  206.4  
      10) CRBI < 114.5 74   302100  141.8 *
      11) CRBI > 114.5 38   567200  332.1 *
   3) CHits > 450 146 27390000  782.8  
     6) Walks < 61 104  9470000  649.6  
      12) AtBat < 395.5 53  2859000  510.0 *
      13) AtBat > 395.5 51  4504000  794.7  
        26) PutOuts < 771 45  2358000  746.4 *
        27) PutOuts > 771 6  1255000 1157.0 *
     7) Walks > 61 42 11500000 1113.0  
      14) RBI < 73.5 22  3148000  885.3  
        28) PutOuts < 239.5 7  1739000 1156.0 *
        29) PutOuts > 239.5 15   656300  758.9 *
      15) RBI > 73.5 20  5967000 1363.0  
        30) Years < 13.5 14  3767000 1521.0  
          60) CAtBat < 3814.5 8   529600 1141.0 *
          61) CAtBat > 3814.5 6   541500 2028.0 *
        31) Years > 13.5 6  1026000  992.5 *

Cant just keep splitting

So, when to stop


The recursive splitting approach is likely to overfit

Leads to a complex tree

A realtively smaller tree can avoid overfitting via lower variance. Providing better interpretation at the cost of little bias

One could say that we split only if reduction in RSS is larger than some value. But this can be short sighted

So what to do


Grow the full Tree.

Prune it back to a smaller subtree

But which is the best subtree

Well, intuitively, one that leads to the lowest test error rate

Finding the best subtree


We can use cross-validation to estimate test error rate

But there can be so many subtrees

So, we can select some small number of trees using cost complexity pruning or weakest link pruning

Cost Complexity Pruning



\(\sum_{m=1}^{|T|}{\sum_{i:x_i \in R_m}{(y_i - \hat{y}_{R_m})^2}} + \alpha |T|\)

The Combined Algorithm

  1. Use Recursive binary splitting to grow a large tree.
  2. Apply cost complexity pruning to get a sequence of best subtrees, as a function of \(\alpha\)
  3. Use K-fold CV to choose \(\alpha\).
    1. Repeat steps 1 and 2 on all but the Kth fold of training data
    2. Evaluate the mse on the left-out Kth fold, as a fucntion of \(\alpha\). Average results of each value of \(\alpha\), choose ove that minimizes average error.
  4. Return the subtree from step 2 that corresponds with the value of chosen \(\alpha\)

Step 1 - Grow the full tree

rsample::initial_split(data = Hitters,prop = .75) -> hit_split
rsample::training(hit_split) -> train_tree
rsample::testing(hit_split) -> test_tree

tree(Salary ~ ., data = train_tree) -> sal_hit_tree

Step 2 - Apply Cost Complexity Pruning

prune.tree(sal_hit_tree)
$size
[1] 9 8 7 6 5 4 3 2 1

$dev
[1] 11072981 11465432 12199624 13012825 14118319 15319451 17139319 22660425
[9] 37306653

$k
[1]       -Inf   392451.4   734192.4   813200.9  1105493.6  1201132.1  1819867.6
[8]  5521106.3 14646227.7

$method
[1] "deviance"

attr(,"class")
[1] "prune"         "tree.sequence"

Step 3 - Apply CV to choose \(\alpha\)

cv.tree(sal_hit_tree,FUN = prune.tree) -> cv_pruned_estimates

cv_pruned_estimates
$size
[1] 9 8 7 6 5 4 3 2 1

$dev
[1] 20512046 19887524 20466514 20910349 20876275 20264130 20031925 25159233
[9] 38125407

$k
[1]       -Inf   392451.4   734192.4   813200.9  1105493.6  1201132.1  1819867.6
[8]  5521106.3 14646227.7

$method
[1] "deviance"

attr(,"class")
[1] "prune"         "tree.sequence"

Step 3 - Apply CV to choose \(\alpha\)

tibble(
    leaves = cv_pruned_estimates$size,
    rss = cv_pruned_estimates$dev
) |>
    ggplot(aes(leaves, rss)) +
    geom_point()+
    geom_line()+
    theme_minimal()

Step 3 - Apply CV to choose \(\alpha\)

Step 4 - get the subtree from the large tree

prune.tree(sal_hit_tree, best = 3) -> best_sub_tree_sal

plot(best_sub_tree_sal)
text(best_sub_tree_sal, pretty = 0)

Lets predict using the model

predict(best_sub_tree_sal, test_tree)
      -Alvin Davis      -Alex Trevino     -Andy VanSlyke         -Bob Boone 
         1526.5449           708.1354           232.8424           232.8424 
     -Bill Madlock     -Bobby Meacham  -BillyJo Robidoux    -Bill Schroeder 
          708.1354           232.8424           232.8424           232.8424 
    -Butch Wynegar   -Carmen Castillo     -Cliff Johnson    -Curt Wilkerson 
          708.1354           232.8424           232.8424           232.8424 
    -Dave Anderson        -Don Baylor      -Daryl Boston      -Dave Collins 
          232.8424           708.1354           232.8424           708.1354 
  -Dave Concepcion     -Darrell Evans     -Damaso Garcia       -Dale Murphy 
          708.1354           708.1354           708.1354           708.1354 
  -Danny Tartabull       -Dickie Thon     -Denny Walling       -Enos Cabell 
          232.8424           708.1354           708.1354           708.1354 
        -Ed Romero       -Ernie Whitt         -Fred Lynn   -George Hendrick 
          232.8424           708.1354           708.1354           708.1354 
       -Garth Iorg     -Gary Matthews     -Graig Nettles   -Garry Templeton 
          708.1354           708.1354           708.1354           708.1354 
       -Joe Carter         -Jim Dwyer       -Jack Howell   -Jeffrey Leonard 
          232.8424           708.1354           232.8424           708.1354 
       -Jorge Orta          -Jim Rice      -Jim Sundberg        -Jim Traber 
          708.1354           708.1354           708.1354           232.8424 
       -Jose Uribe   -Joel Youngblood       -Kirk Gibson        -Ken Phelps 
          232.8424           708.1354           708.1354           232.8424 
      -Len Dykstra     -Larry Herndon     -Lance Parrish       -Luis Rivera 
          232.8424           708.1354           708.1354           232.8424 
     -Lonnie Smith        -Mike Brown       -Mike Easler   -Mike Pagliarulo 
          708.1354           232.8424           708.1354           232.8424 
      -Ozzie Smith       -Phil Garner   -Rafael Belliard      -Rick Dempsey 
          708.1354           708.1354           232.8424           708.1354 
 -Rickey Henderson        -Ron Kittle     -Randy Kutcher        -Rick Leach 
          708.1354           232.8424           232.8424           232.8424 
     -Rick Manning   -Rance Mulliniks      -Rey Quinones    -Rafael Ramirez 
          708.1354           708.1354           232.8424           708.1354 
    -Ryne Sandberg       -Roy Smalley    -Robby Thompson    -Steve Buechele 
          708.1354           708.1354           232.8424           232.8424 
   -Shawon Dunston      -Steve Garvey -Steve Lombardozzi    -Tony Fernandez 
          232.8424          1526.5449           232.8424           708.1354 
       -Tony Gwynn       -Toby Harrah       -Ted Simmons     -Wally Backman 
          708.1354           708.1354           708.1354           708.1354 
       -Wade Boggs      -Wally Joyner  -Wayne Krenchicki   -Willie Randolph 
          708.1354           232.8424           232.8424           708.1354 
    -Willie Upshaw 
         1526.5449 

Do it yourself

For the Boston data set from {ISLR2}, run a decision tree model that estimates the median value of owner-occupied homes in $1000. Complete the full algorith steps of growing a larger tree and choosing an alpha for the best possible subtree.

Classification Trees

Response is qualitative

For making predictions, instead of using the mean of the training observations in a region, the most commonly occuring class of training observations in a region is used.

Classification trees are grown in a similar fashion to regression trees But, instead of RSS we use the classification error rate

Classification Error Rate (E): The fraction of training observations in a region that do not belong to the most common class.

Classification Error Rate


\[E = 1- max(\hat{p}_{mk}) \]

Here \(\hat{p}_{mk}\) is the proportion of training observations in the mth region of class k.

But, this is not sufficiently sensitive for tree growing, so we use other measures in practice.

Classification Error Rate - Issues

flowchart 
    A[A50,B50] --> B[A10,B40]
    A[A50,B50] --> C[A40,B10]

flowchart 
    A[A50,B50] --> B[A50,B20]
    A[A50,B50] --> C[A0,B30]

Classification Error Rate - Issues

Does not favour “pure” nodes

Only cares about the most common class, ignores by how much

May result in premature stop in tree growing

Insensitive to minor changes in patterns

Alternative - Gini



\(G = \sum_{k=1}^{K}{\hat{p}_{mk}(1-\hat{p}_{mk})}\)

The Value of G will be small when node purity is high.

Alternative - Entropy



\(D = - \sum_{k=1}^{K}{\hat{p}_{mk}log\hat{p}_{mk}}\)

The value of D will be near zero for \(\hat{p}_{mk}\) near zero or one

Classification Error Rate - Issues

tibble::tibble(
  prop_a = seq(0,1,by = 0.001),
  prop_b = 1- prop_a,
  max = ifelse(prop_a>=prop_b, prop_a, prop_b),
  E =  1 - max,
  G = 2*(prop_a*prop_b),
  En = -1*((prop_a*log(prop_a))+(prop_b*log(prop_b)))
)  |> 
  ggplot2::ggplot(ggplot2::aes(prop_a,E))+
  ggplot2::geom_line() +
  ggplot2::geom_line(ggplot2::aes(y = G), colour = "green") +
  ggplot2::geom_line(ggplot2::aes(y = En), colour = "steelblue") +
  ggplot2::theme_minimal() +
  ggplot2::labs(
    x = "probablity of Class A at a given node",
    y = "Y",
    title = "Comparing Classificaiton Error Rate, Gini and Entropy",
    subtitle = "Toy example with two classes: A and B"
  )

Classification Error Rate - Issues

A comment on E, G and D



Best to use G or D for splitting

For pruning and gauging overall accuracy E is preferable if prediction accuracy is the goal.

The {tree} takes care of all this by default

code refrence

tree(league ~ . , data = Hitters) -> mod

prune(mod, method = "misclass")

cv.tree(mod, FUN = prune.misclass)

Do it yourself

This link has the Heart data for paitents with chest pains. The variable AHD is a binary variable where Yes refers to existing heart disease. Create a decision tree model, a full grown tree, to understand factors for predicting AHD. Also, prune this tree by choosing the appropriate \(\alpha\). Write an interpretation note for your model.
Does your final and full grown trees have some spits that estimate the same response for terminal nodes? Why does that happen?